
In today's dynamic workplace landscape, understanding and anticipating factors influencing employee retention are crucial for organizational success. Through advanced analytics and machine learning, our project aims to unveil patterns and indicators contributing to turnover, providing valuable insights for proactive talent management and fostering a more resilient and engaged workforce. Join us on this journey to harness the power of data to enhance employee retention strategies and optimize organizational performance.
The act of an employee quitting their job can have a negative impact on the workplace, reducing efficiency and productivity. This study aims to investigate the various factors that contribute to employees leaving their jobs. These factors can range from external reasons, such as a negative workplace environment that can lead to an employee feeling undervalued and unappreciated, to personal reasons, such as work-life imbalances or the desire to seek a different career path. Understanding these reasons is crucial for employers so that they can take appropriate measures to create a better workplace environment for their employees. To achieve this, we have gathered relevant data sets from Kaggle that contain information about employees. By studying these data sets, we aim to form a model that accurately predicts whether an employee is likely to quit or not.
Table of Contants¶
Methodology to analyse each feature¶
Overview¶
Catalog of used functions' implementaions¶
Considering Technical Reasons¶
Department¶
Job Roles/Levels¶
Monthly Income¶
Experience¶
Over Time¶
Level of Satisfaction¶
Considering Personal Reasons¶
Age¶
Gender¶
Educational background¶
Marital Status¶
Distance From Home¶
WorkLife Balance¶
Statistical Analysis¶
Correlation Matrix¶
Point biserial ¶
T Test summary¶
Chi Square ( $\chi^2$ ) Test summary¶
Predictive Model (Logistic regression)¶
Bootstrap Confidence interval¶
Set Risk Level and Attrition liklehood Features¶
Feature Importance according to model¶
Interactive dashboard that predict Risk/Liklehood¶
Conclusions & Tips for company¶
Refrences¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.gridspec as gridspec
import scikit_posthocs as sp
import statsmodels.api as sm
from ipywidgets import Dropdown, IntSlider, interact
from scipy import stats
from scipy.stats import t
from plotly.subplots import make_subplots
from category_encoders import OneHotEncoder
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay, accuracy_score, classification_report, roc_auc_score, precision_score , recall_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline
df = pd.read_csv("Employee-Attrition.csv")
df.head()
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | ... | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | ... | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | ... | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | ... | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | ... | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
5 rows × 35 columns
def ReplaceIt(i,st=""):
df[st]=df[st].replace(1-i,'Low')
df[st]=df[st].replace(2-i,'Medium')
df[st]=df[st].replace(3-i,'High')
df[st]=df[st].replace(4-i,'Very High')
CatList = ["EnvironmentSatisfaction", "JobSatisfaction","RelationshipSatisfaction","WorkLifeBalance","StockOptionLevel"]
i=0
for st in CatList:
if st =="StockOptionLevel":
i = i+1
ReplaceIt(i,st)
df["JobLevel"] = df["JobLevel"].replace(
{1:"Entry Level",
2:"Junior Level",
3:"Mid Level",
4:"Senior Level",
5:"Executive Level"}
)
df["Education"] = df["Education"].replace(
{1:"Below College",
2:"College",
3:"Bachelor",
4:"Master",
5:"Doctor"}
)
df['PerformanceRating']=df['PerformanceRating'].replace(3,'Excellent')
df['PerformanceRating']=df['PerformanceRating'].replace(4,'Outstanding')
Our methodology to analyse features:¶
- ### Take an overview of our data (ckeck for null, duplicates,... and discover the central tendency and variability of features).
- ### Divide features into technical features and personal ones.
- ### Formulate our hypothesis that (Null $H_0$: Any feature has no effect on attrition and each group of data is independent of others in whole features).
- ### At each feature we use the appropriate statistical test to check our hypothesis:
T_test for two groups of numerical groups.
ANOVA if more than two groups.
PostHoc to know exactly any pair of groups are significantly independent (if needed).
Chi square $\chi^2$ test for two categorical variables.- ### Utilize visual tools like histograms, box plots, scatter plots, and pie charts to uncover patterns or connections between this features and attrition or other features also according to our testing needs.
- ### Then We incorporate our observations based on these tests and visualizations.
df.describe()
| Age | DailyRate | DistanceFromHome | EmployeeCount | EmployeeNumber | HourlyRate | JobInvolvement | MonthlyIncome | MonthlyRate | NumCompaniesWorked | PercentSalaryHike | StandardHours | TotalWorkingYears | TrainingTimesLastYear | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1470.000000 | 1470.000000 | 1470.000000 | 1470.0 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.0 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 |
| mean | 36.923810 | 802.485714 | 9.192517 | 1.0 | 1024.865306 | 65.891156 | 2.729932 | 6502.931293 | 14313.103401 | 2.693197 | 15.209524 | 80.0 | 11.279592 | 2.799320 | 7.008163 | 4.229252 | 2.187755 | 4.123129 |
| std | 9.135373 | 403.509100 | 8.106864 | 0.0 | 602.024335 | 20.329428 | 0.711561 | 4707.956783 | 7117.786044 | 2.498009 | 3.659938 | 0.0 | 7.780782 | 1.289271 | 6.126525 | 3.623137 | 3.222430 | 3.568136 |
| min | 18.000000 | 102.000000 | 1.000000 | 1.0 | 1.000000 | 30.000000 | 1.000000 | 1009.000000 | 2094.000000 | 0.000000 | 11.000000 | 80.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 30.000000 | 465.000000 | 2.000000 | 1.0 | 491.250000 | 48.000000 | 2.000000 | 2911.000000 | 8047.000000 | 1.000000 | 12.000000 | 80.0 | 6.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 |
| 50% | 36.000000 | 802.000000 | 7.000000 | 1.0 | 1020.500000 | 66.000000 | 3.000000 | 4919.000000 | 14235.500000 | 2.000000 | 14.000000 | 80.0 | 10.000000 | 3.000000 | 5.000000 | 3.000000 | 1.000000 | 3.000000 |
| 75% | 43.000000 | 1157.000000 | 14.000000 | 1.0 | 1555.750000 | 83.750000 | 3.000000 | 8379.000000 | 20461.500000 | 4.000000 | 18.000000 | 80.0 | 15.000000 | 3.000000 | 9.000000 | 7.000000 | 3.000000 | 7.000000 |
| max | 60.000000 | 1499.000000 | 29.000000 | 1.0 | 2068.000000 | 100.000000 | 4.000000 | 19999.000000 | 26999.000000 | 9.000000 | 25.000000 | 80.0 | 40.000000 | 6.000000 | 40.000000 | 18.000000 | 15.000000 | 17.000000 |
print(df.shape)
print(df.duplicated().sum())
tabel = pd.DataFrame({
'Unique':df.nunique(),
'Null':df.isna().sum(),
'Type':df.dtypes.values
})
display(tabel)
(1470, 35) 0
| Unique | Null | Type | |
|---|---|---|---|
| Age | 43 | 0 | int64 |
| Attrition | 2 | 0 | object |
| BusinessTravel | 3 | 0 | object |
| DailyRate | 886 | 0 | int64 |
| Department | 3 | 0 | object |
| DistanceFromHome | 29 | 0 | int64 |
| Education | 5 | 0 | object |
| EducationField | 6 | 0 | object |
| EmployeeCount | 1 | 0 | int64 |
| EmployeeNumber | 1470 | 0 | int64 |
| EnvironmentSatisfaction | 4 | 0 | object |
| Gender | 2 | 0 | object |
| HourlyRate | 71 | 0 | int64 |
| JobInvolvement | 4 | 0 | int64 |
| JobLevel | 5 | 0 | object |
| JobRole | 9 | 0 | object |
| JobSatisfaction | 4 | 0 | object |
| MaritalStatus | 3 | 0 | object |
| MonthlyIncome | 1349 | 0 | int64 |
| MonthlyRate | 1427 | 0 | int64 |
| NumCompaniesWorked | 10 | 0 | int64 |
| Over18 | 1 | 0 | object |
| OverTime | 2 | 0 | object |
| PercentSalaryHike | 15 | 0 | int64 |
| PerformanceRating | 2 | 0 | object |
| RelationshipSatisfaction | 4 | 0 | object |
| StandardHours | 1 | 0 | int64 |
| StockOptionLevel | 4 | 0 | object |
| TotalWorkingYears | 40 | 0 | int64 |
| TrainingTimesLastYear | 7 | 0 | int64 |
| WorkLifeBalance | 4 | 0 | object |
| YearsAtCompany | 37 | 0 | int64 |
| YearsInCurrentRole | 19 | 0 | int64 |
| YearsSinceLastPromotion | 16 | 0 | int64 |
| YearsWithCurrManager | 18 | 0 | int64 |
df.drop(columns=['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber'], inplace=True)
df.hist(figsize=(15,15))
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(1,2,figsize=(13,6))
sns.countplot(x="Attrition", data=df,ax =axes[0])
for container in axes[0].containers:
axes[0].bar_label(container)
axes[1].pie(df["Attrition"].value_counts().values, labels=df["Attrition"].value_counts().index,autopct='%1.2f%%',explode=[0, 0.1])
axes[1].legend()
plt.suptitle("Overall atrrition",fontsize=20)
Text(0.5, 0.98, 'Overall atrrition')
# apply t test between two groups seperated by "group" column in given value
def TTest(val="", group=""):
uni = df[group].unique()
result = stats.ttest_ind(df[df[group]==uni[0]][val] , df[df[group]==uni[1]][val])
return result.statistic,stats.t.ppf(1-.05/2,result.df), result.pvalue
# apply t test between all groups seperated by "group" column in all numerical value
def ApplyTTest(group=""):
T_Score = {}
CriticalT = {}
p_values = {}
for col in df.select_dtypes("number").columns:
T_Score[col],CriticalT[col],p_values[col] =TTest(col, group)
columns = list(T_Score.keys())
values = list(T_Score.values())
critical = list(CriticalT.values())
test_df = pd.DataFrame(
{"Features":columns,
"T Value":values,
"Critical Value":critical}
)
test_df["P_value"] = [format(p, '.20f') for p in list(p_values.values())]
test_df["P_value"] = test_df["P_value"].astype(float)
test_df["Result"] = test_df["P_value"].map(lambda x:"Accept" if x > 0.05 else "Reject")
return test_df, p_values
# calculate max estimation error in num between cat groups
def CalcMEE(cat="", num="", t=1.96):
#t= 1.96 -> 0.95 CI
n ={}
std= {}
EstError ={}
Average = {}
for col in df[cat].unique():
group = df[df[cat] == col][num]
Q1, Q3 = group.quantile([0.25, 0.75])
IQR = Q3 - Q1
upper = Q3 +1.5*IQR
lower = Q1 - 1.5*IQR
mask = group.between(lower, upper)
group = group[mask]
n[col] = len(group)
std[col] = group.std()
Average[col] = group.mean()
t= stats.t.ppf(q=1-.05/2,df=len(group)-1)
EstError[col] = t* std[col] / np.sqrt(n[col])
columns = list(n.keys())
N = list(n.values())
Std = list(std.values())
E = list(EstError.values())
Avg = list(Average.values())
test_df = pd.DataFrame(
{cat:columns,
"Total number":N,
"Standard Deviation":Std,
"Mean":Avg,
"Max Error":E,
}
)
test_df["Max Error"] = ["+/- "+ str(round(e,4)) for e in E]
return test_df
# same as CalcMME but divide groups by attrition also
def SpecialCalcMEE(cat="", num="", t=1.96):
#t= 1.96 -> 0.95 CI
n ={}
std= {}
EstError ={}
Average = {}
columns=[]
Att =df[df["Attrition"]=="Yes"]
NoAtt =df[df["Attrition"]=="No"]
for col in df[cat].unique():
group = Att[Att[cat] == col][num]
colname = col+"-Yes"
columns.append(colname)
Q1, Q3 = group.quantile([0.25, 0.75])
IQR = Q3 - Q1
upper = Q3 +1.5*IQR
lower = Q1 - 1.5*IQR
mask = group.between(lower, upper)
group = group[mask]
n[colname] = len(group)
std[colname] = group.std()
Average[colname] = group.mean()
t= stats.t.ppf(q=1-.05/2,df=len(group)-1)
EstError[colname] = t* std[colname] / np.sqrt(n[colname])
group = NoAtt[NoAtt[cat] == col][num]
colname = col+"-No"
columns.append(colname)
Q1, Q3 = group.quantile([0.25, 0.75])
IQR = Q3 - Q1
upper = Q3 +1.5*IQR
lower = Q1 - 1.5*IQR
mask = group.between(lower, upper)
group = group[mask]
n[colname] = len(group)
std[colname] = group.std()
Average[colname] = group.mean()
t= stats.t.ppf(q=1-.05/2,df=len(group)-1)
EstError[col] = t* std[colname] / np.sqrt(n[colname])
N = list(n.values())
Std = list(std.values())
E = list(EstError.values())
Avg = list(Average.values())
test_df = pd.DataFrame(
{cat:columns,
"Total number":N,
"Standard Deviation":Std,
"Mean":Avg,
"Max Error":E,
}
)
test_df["Max Error"] = ["+/- "+ str(round(e,4)) for e in E]
return test_df
#Plot Attrition per department
def Plot(dep=""):
Depatment=df.groupby(['Department','Attrition'])["Age"].count().reset_index(name='Counts')
Depatment=Depatment[Depatment["Department"]==dep]
fig, axes = plt.subplots(1,2,figsize=(13,6))
sns.barplot(x="Attrition",y="Counts", data=Depatment,ax =axes[0])
for container in axes[0].containers:
axes[0].bar_label(container)
axes[1].pie(Depatment["Counts"], labels=Depatment["Attrition"].value_counts().index,autopct='%1.2f%%',explode=[0, 0.1])
axes[1].legend()
plt.suptitle(dep,fontsize=20)
#Do Chi square test between two given categories features
def ChiSquared(cat1="", cat2="Attrition"):
contingency_table = pd.crosstab(df[cat1], df[cat2])
chi2, p_value, deg, _ = stats.chi2_contingency(contingency_table)
CriticalChi = stats.chi2.ppf(0.95, deg)
return chi2,CriticalChi, p_value
#plot p values of results of ANOVA test or chi
def plotP(p_values,st="", ISchi = False, tit =""):
plt.figure(figsize=(12,6))
keys = list(p_values.keys())
values = list(p_values.values())
sorted_value_index = list(reversed(np.argsort(values)))
sorted_dict = {keys[i]: values[i] for i in sorted_value_index}
keys = list(sorted_dict.keys())
values = list(sorted_dict.values())
sns.barplot(x=keys, y=values, palette='rocket')
if tit =="":
if ISchi:
plt.title(f"P_scores Comparison according Chi Square test between Categorical variables and Atrrition")
else:
plt.title(f"Anova Test P_score Comparison within {st} Groups")
else:
plt.title(tit)
plt.axhline(y=0.05,color='red',linestyle ="--", label="P_value = 0.05")
plt.legend()
plt.xticks(rotation=90)
for index,value in enumerate(values):
plt.text(index,value,round(value,3), ha="center", va="bottom")
plt.show()
# apply ANOVA test to all numerical values of groups based on give cat feature st
def SingleANOVA(st="", num=""):
f_scores = {}
p_values = {}
Arg = []
for col in df[st].unique():
Arg.append(df[num][df[st] == col])
f_score, p_value = stats.f_oneway(*Arg)
dof_within = len(df) - len(Arg)
dof_between = len(Arg) -1
CriticalF = stats.f.ppf(1-0.05,dof_between,dof_within )
return f_score,p_value, CriticalF
# same as SingleANOVA function but it divide groups by attrition and the givin st
def SpecialSingleANOVA(st="", num=""):
f_scores = {}
p_values = {}
Arg = []
Att = df[df["Attrition"]=="Yes"]
for col in Att[st].unique():
Arg.append(Att[num][Att[st] == col])
NoAtt = df[df["Attrition"]=="No"]
for col in NoAtt[st].unique():
Arg.append(NoAtt[num][NoAtt[st] == col])
f_score, p_value = stats.f_oneway(*Arg)
dof_within = len(df) - len(Arg)
dof_between = len(Arg) -1
CriticalF = stats.f.ppf(1-0.05,dof_between,dof_within )
return f_score,p_value, CriticalF
# apply ANOVA test to all numerical values of groups based on give cat feature st
def ANOVA(st=""):
f_scores = {}
p_values = {}
CriticalF = {}
for num in df.select_dtypes("number").columns:
Arg = []
f_scores[num],p_values[num], CriticalF[num] = SingleANOVA(st, num)
plotP(p_values,st)
return f_scores,p_values
# same as ANOVA function but it divide groups by attrition and the givin st
def SpecialANOVA(st=""):
f_scores = {}
p_values = {}
CriticalF = {}
for num in df.select_dtypes("number").columns:
Arg = []
f_scores[num],p_values[num], CriticalF[num] = SpecialSingleANOVA(st, num)
plotP(p_values,st)
return f_scores,p_values
# plot error bar between given p values with p = 0.5
def PlotErrors(p_values,st ="" ,ISchi = False):
plt.figure(figsize=(12,6))
x= list(p_values.keys())
y= list(p_values.values())
y_errormin =len(x) *[0.5]
y_errormax =len(x) *[0.45]
y_error =[y_errormin, y_errormax]
plt.errorbar(x, y,
yerr = y_error,
fmt='o',capsize=10)
plt.xticks(rotation=90)
plt.axhline(y=0.5,color='red')
if ISchi:
plt.title(f"Chi Square P_score Error bar between Categorical features and Attrition")
else:
plt.title(f"ANOVA Test P_score Comparison within {st} Groups")
# bar plot for chi square results
def PlotChi(keys,values):
plt.figure(figsize=(12,6))
sorted_value_index = list(reversed(np.argsort(values)))
sorted_dict = {keys[i]: values[i] for i in sorted_value_index}
keys = list(sorted_dict.keys())
values = list(sorted_dict.values())
sns.barplot(x=keys, y=values, palette="mako")
plt.xticks(rotation=90)
plt.title("Chi2 Statistic Value of each Categorical Columns with Attrition",fontweight="black")
for index,value in enumerate(values):
plt.text(index,value,round(value,2),ha="center",va="bottom")
plt.show()
## add percentage column to given Df to use in annot bar plot
def AddPercentage(Df, st=""):
lenght = len(Df[st])
Df["percent"]=lenght*[""]
i=0
while i <lenght :
Sum = Df.iloc[i,2] +Df.iloc[i+1,2]
prc1 = round((Df.iloc[i,2] / Sum) * 100 , 2)
Df.iloc[i,3] = str(prc1) + ' %'
prc2= round( (Df.iloc[i+1,2] / Sum )* 100,2)
Df.iloc[i+1,3] = str(prc2) + ' %'
i =i+2
return Df
#plot four beside pie charts
def PlotPies(st=""):
bus=df.groupby([st,'Attrition'],as_index=False)['Age'].count()
bus.rename(columns={'Age':'Count'},inplace=True)
fig=go.Figure()
fig = make_subplots(rows=1, cols=4,
specs=[[{"type": "pie"}, {"type": "pie"},{"type": "pie"}, {"type": "pie"}]],
subplot_titles=('Very High', 'High','Medium','Low'))
fig.add_trace(go.Pie(values=bus[bus[st]=='Very High']['Count'],labels=bus[bus[st]=='Very High']['Attrition'],pull=[0,0.1],showlegend=False)
,row=1,col=1)
fig.add_trace(go.Pie(values=bus[bus[st]=='High']['Count'],labels=bus[bus[st]=='High']['Attrition'],pull=[0,0.1],showlegend=False)
,row=1,col=2)
fig.add_trace(go.Pie(values=bus[bus[st]=='Medium']['Count'],labels=bus[bus[st]=='Medium']['Attrition'],pull=[0,0.1],showlegend=False)
,row=1,col=3)
fig.add_trace(go.Pie(values=bus[bus[st]=='Low']['Count'],labels=bus[bus[st]=='Low']['Attrition'],pull=[0,0.1],showlegend=True)
,row=1,col=4)
fig.update_layout(title_x=0.5,template='simple_white',showlegend=True,
legend_title_text="Attrition",title_text=f"<b style='color:black; font-size:100%;'>Employee Attrition based on {st}",
font_family="Times New Roman",title_font_family="Times New Roman")
fig.show()
for dep in set(df["Department"]):
Plot(dep)
Null Hypothesis$H_0$ : Departement has no effect on the decision of quit.
Alternative Hypothesis$H_a$ : Departement has a significant effect on the decision of quit.
pd.crosstab(df['Department'], df['Attrition'])#.reset_index()
| Attrition | No | Yes |
|---|---|---|
| Department | ||
| Human Resources | 51 | 12 |
| Research & Development | 828 | 133 |
| Sales | 354 | 92 |
chi2,CriticalChi, p_value = ChiSquared("Department")
$X^2 = 10.796$
$X^2_c = 5.991$
$Pvalue =0.0045$
We Will reject our Null Hypothesis and accept the alternative that says " Departement has a significant effect on the decision of quit ".
ANOVA test.
Null hypothesis($H_0$) : There is no significant difference among the groups.¶
alternative hypothesis($H_a$) : There is a significant different among groups.¶
f_scores,p_values = ANOVA("Department")
We will reject our null hypothesis for Monthly income as the test show there is a significant difference in groups from each departmenet
But which pair of groups between all the groups that has a significant difference in Monthly income?
to know that let's do post_hoc test by help of T_test
sp.posthoc_ttest(df, val_col='MonthlyIncome', group_col='Department')
| Sales | Research & Development | Human Resources | |
|---|---|---|---|
| Sales | 1.000000 | 0.010998 | 0.599484 |
| Research & Development | 0.010998 | 1.000000 | 0.562536 |
| Human Resources | 0.599484 | 0.562536 | 1.000000 |
We can see that there is a significant difference in Monthly income between Sales Department and Research & Development department between employees who quit. We see in our EDA analysis that Sales department has the highest attrition rate and Research & Development department has the lowest one, so may be the difference between salaries between departments make employees turnover.
Recap:¶
- Sales department has the highest attrition rate.
- Research department has the highest number of employees and the lowest atrrition rate.
- Human Resourses has the lowest number of employees.
- Departement has a significant effect on the decision of quit.
- There is a significant difference in Monthly income between departments groups especially Sales department and Research & Development department
jobRole = df.groupby(['JobRole', 'Attrition'])["Age"].count().reset_index(name='Counts')
jobRole = AddPercentage(jobRole,"JobRole")
px.bar(jobRole, y ="Counts", x ="JobRole",color="Attrition", text='percent')
ChiSquared("JobRole")
(86.19025367670434, 15.50731305586545, 2.752481638050657e-15)
$X^2 = 86.1903$
$X^2_c = 15.5073$
$Pvalue = 2.7525 *{10}^{-15}$
We Will reject our Null Hypothesis and accept the alternative that says " JobRole has a significant effect on the decision of quit ".
byJobLevel = df.groupby(['JobLevel', 'Attrition'])["Age"].count().reset_index(name='Counts')
byJobLevel = AddPercentage(byJobLevel,"JobLevel")
order = ["Entry Level","Junior Level","Mid Level","Senior Level","Executive Level"]
px.bar(byJobLevel, y ="Counts", x ="JobLevel",color="Attrition", text='percent',category_orders= {'JobLevel':order})
ChiSquared("JobLevel")
(72.52901310667391, 9.487729036781154, 6.634684715458909e-15)
$X^2 = 72.5290$
$X^2_c = 9.4877$
$Pvalue = 6.6347*{10}^{-15}$
We Will reject our Null Hypothesis and accept the alternative that says " JobLevel has a significant effect on the decision of quit ".
f_scores, p_values =ANOVA("JobRole")
f_scores, p_values =ANOVA("JobLevel")
The test shows that there is a significant difference between groups from different job roles/levels in values such as:¶
[ "Age" , "MonthlyIncome" "NumCompaniesWorked" , "TotalWorkingYears" "YearsAtCompany" , "YearsInCurrentRole" "YearsSinceLastPromotion", "YearsWithCurrManager" ]
px.box(df, x="JobLevel", y="Age", title="Age distribution by different job roles")
Recap:¶
- Most employees is working as Sales executive, Research Scientist or Laboratory Technician. in this organization.
- Highest attrition rates are in sector of Research Director, Sales Executive, Research Scientist.
- Laboratory Techincian and HR job roles also has a high attrition rate
- highest arrtition rates are in Entry and mid Job level.
- Employees that have High Job level such as Seniors and Executive level tend to stay in company more than the others.
- There is a significant difference between different job roles/ levels in monthly income and values that related with time and experience such that age, total working year and other features.
- JobRole and JobLevel each has a significant effect on the decision of quit.
plt.figure(figsize=(13,6))
plt.subplot(1,2,1)
sns.histplot(x="MonthlyIncome", hue="Attrition", kde=True,data=df )
plt.title("Employee Attrition by Monthly Income")
plt.subplot(1,2,2)
sns.boxplot(x="Attrition",y="MonthlyIncome",data=df)
plt.title("Employee Attrition by Monthly Income")
plt.tight_layout()
plt.show()
TTest("MonthlyIncome", "Attrition")
(-6.203935765608938, 1.9615812836543436, 7.14736398535381e-10)
T Value= $-6.204$
Critical value= $-1.96$
Pvalue= $7.1474*{10}^{-15}$
Since $T < |T_C|$ or $P << 0.05$ We Will Accept Alternative Hypothesis that says " Monthly income has a significant effect on attrition ".
ANOVA test to know what categorical variables depend on monthly income¶p_values = {}
for st in df.select_dtypes("object").columns:
_,p_values[st],_ = SingleANOVA(st, "MonthlyIncome")
plotP(p_values,tit="p values resulted from ANOVA test between groups based on Monthly income")
SingleANOVA("Department", "MonthlyIncome")
(3.201782929420171, 0.04097409724987449, 3.001858137213299)
$F = 3.2017$
$F_c=3.0018$
$Pvalue=0.04097$
sp.posthoc_ttest(df, val_col='MonthlyIncome', group_col='Department').apply(lambda x:round(x, 5))
| Sales | Research & Development | Human Resources | |
|---|---|---|---|
| Sales | 1.00000 | 0.01100 | 0.59948 |
| Research & Development | 0.01100 | 1.00000 | 0.56254 |
| Human Resources | 0.59948 | 0.56254 | 1.00000 |
df['Department'].unique()
array(['Sales', 'Research & Development', 'Human Resources'], dtype=object)
fig1=px.box( df,x="Department" ,y="MonthlyIncome")
fig2=px.box( df,x="Department" ,y="MonthlyIncome", color="Attrition")
fig1.show()
fig2.show()
CalcMEE("Department","MonthlyIncome" ) #95% CL
| Department | Total number | Standard Deviation | Mean | Max Error | |
|---|---|---|---|---|---|
| 0 | Sales | 415 | 2867.500943 | 6153.732530 | +/- 276.6936 |
| 1 | Research & Development | 869 | 3146.772819 | 5027.997699 | +/- 209.5123 |
| 2 | Human Resources | 55 | 3559.526833 | 4864.018182 | +/- 962.2749 |
result= SpecialCalcMEE("Department","MonthlyIncome" )
errors = [float(x[3:]) for x in result["Max Error"]]
px.bar(result, x="Department", y="Mean", error_y= errors, title="Bar Plot of mean Monthly income with max error of estimation")
px.box( df,x="JobLevel" ,y="MonthlyIncome",category_orders= {'JobLevel':order})
CalcMEE("JobLevel","MonthlyIncome" )
| JobLevel | Total number | Standard Deviation | Mean | Max Error | |
|---|---|---|---|---|---|
| 0 | Junior Level | 513 | 1161.309722 | 5334.621832 | +/- 100.7315 |
| 1 | Entry Level | 526 | 665.460059 | 2721.863118 | +/- 57.0006 |
| 2 | Mid Level | 218 | 1805.999233 | 9817.252294 | +/- 241.0828 |
| 3 | Senior Level | 106 | 1816.239003 | 15503.783019 | +/- 349.7859 |
| 4 | Executive Level | 69 | 512.383127 | 19191.826087 | +/- 123.0879 |
result = SpecialCalcMEE("JobLevel","MonthlyIncome" )
errors = [float(x[3:]) for x in result["Max Error"]]
px.bar(result, x="JobLevel", y="Mean", error_y= errors,title="Bar Plot of mean Monthly income with max error of estimation")
sp.posthoc_ttest(df, val_col='MonthlyIncome', group_col='JobRole').apply(lambda x:round(x, 5))
| Sales Executive | Research Scientist | Laboratory Technician | Manufacturing Director | Healthcare Representative | Manager | Sales Representative | Research Director | Human Resources | |
|---|---|---|---|---|---|---|---|---|---|
| Sales Executive | 1.00000 | 0.00000 | 0.00000 | 0.13262 | 0.01607 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| Research Scientist | 0.00000 | 1.00000 | 0.97773 | 0.00000 | 0.00000 | 0.00000 | 0.00002 | 0.00000 | 0.00001 |
| Laboratory Technician | 0.00000 | 0.97773 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00001 | 0.00000 | 0.00001 |
| Manufacturing Director | 0.13262 | 0.00000 | 0.00000 | 1.00000 | 0.45905 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| Healthcare Representative | 0.01607 | 0.00000 | 0.00000 | 0.45905 | 1.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| Manager | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00298 | 0.00000 |
| Sales Representative | 0.00000 | 0.00002 | 0.00001 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 0.00000 | 0.00000 |
| Research Director | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00298 | 0.00000 | 1.00000 | 0.00000 |
| Human Resources | 0.00000 | 0.00001 | 0.00001 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
px.box( df,x="JobRole" ,y="MonthlyIncome")
result =SpecialCalcMEE("JobRole","MonthlyIncome" )
errors = [float(x[3:]) for x in result["Max Error"]]
px.bar(result, x="JobRole", y="Mean", error_y= errors,title="Bar Plot of mean Monthly income with max error of estimation")
CalcMEE("JobRole","MonthlyIncome" )
| JobRole | Total number | Standard Deviation | Mean | Max Error | |
|---|---|---|---|---|---|
| 0 | Sales Executive | 325 | 2338.860456 | 6902.901538 | +/- 255.2325 |
| 1 | Research Scientist | 285 | 1036.962913 | 3146.663158 | +/- 120.9048 |
| 2 | Laboratory Technician | 254 | 1049.060730 | 3168.397638 | +/- 129.6326 |
| 3 | Manufacturing Director | 145 | 2676.745753 | 7295.137931 | +/- 439.3761 |
| 4 | Healthcare Representative | 131 | 2542.550170 | 7528.763359 | +/- 439.4846 |
| 5 | Manager | 94 | 1751.984116 | 17644.212766 | +/- 358.8411 |
| 6 | Sales Representative | 72 | 447.921852 | 2550.986111 | +/- 105.2565 |
| 7 | Research Director | 80 | 2827.621369 | 16033.550000 | +/- 629.2563 |
| 8 | Human Resources | 52 | 2438.849744 | 4235.750000 | +/- 678.9801 |
sp.posthoc_ttest(df, val_col='MonthlyIncome', group_col='Education').apply(lambda x:round(x, 5))
| College | Below College | Master | Bachelor | Doctor | |
|---|---|---|---|---|---|
| College | 1.00000 | 0.18144 | 0.09133 | 0.39796 | 0.00462 |
| Below College | 0.18144 | 1.00000 | 0.00491 | 0.03469 | 0.00057 |
| Master | 0.09133 | 0.00491 | 1.00000 | 0.30994 | 0.04484 |
| Bachelor | 0.39796 | 0.03469 | 0.30994 | 1.00000 | 0.01571 |
| Doctor | 0.00462 | 0.00057 | 0.04484 | 0.01571 | 1.00000 |
px.box( df,x="Education" ,y="MonthlyIncome",points="all")
CalcMEE("Education","MonthlyIncome" )
| Education | Total number | Standard Deviation | Mean | Max Error | |
|---|---|---|---|---|---|
| 0 | College | 251 | 2368.879924 | 4866.195219 | +/- 294.4841 |
| 1 | Below College | 159 | 3113.741827 | 4776.276730 | +/- 487.7209 |
| 2 | Master | 368 | 3336.802687 | 5873.663043 | +/- 342.0498 |
| 3 | Bachelor | 544 | 3966.447131 | 5866.963235 | +/- 334.0561 |
| 4 | Doctor | 48 | 5061.430495 | 8277.645833 | +/- 1469.6862 |
result =SpecialCalcMEE("Education", "MonthlyIncome")
errors = [float(x[3:]) for x in result["Max Error"]]
px.bar(result, x="Education", y="Mean", error_y= errors,title="Bar Plot of mean Monthly income with max error of estimation")
Recap:¶
- Most of the employees are getting paid less than 10000 in the organiation.
- Monthly income plays significant role in attrition.
- there is significant difference in monthly income between defferent groups of
Department,Education,Job levels,Job roles- Employees who quit as a manager , manufacturing director or Research director vary greatly in monthly income.
- Active employees in Human resources department also vary in monthly income.
- Doctors in general are vary a lot in their monthly income.
AttOvered =df.groupby(["OverTime",'Attrition'],as_index=False)['Age'].count()
AttOvered.rename(columns={'Age':'Count'},inplace=True)
AttOvered=AddPercentage(AttOvered,"OverTime")
Overed =df.groupby(["OverTime"],as_index=False)['Age'].count()
Overed.rename(columns={'Age':'Count'},inplace=True)
Overed
| OverTime | Count | |
|---|---|---|
| 0 | No | 1054 |
| 1 | Yes | 416 |
AttOvered
| OverTime | Attrition | Count | percent | |
|---|---|---|---|---|
| 0 | No | No | 944 | 89.56 % |
| 1 | No | Yes | 110 | 10.44 % |
| 2 | Yes | No | 289 | 69.47 % |
| 3 | Yes | Yes | 127 | 30.53 % |
plt.pie(Overed["Count"], labels=Overed["OverTime"], autopct='%1.1f%%',explode=[0, 0.1])
plt.show()
px.bar(AttOvered,x="OverTime", y="Count", color="Attrition", text="percent")
fig = plt.figure(figsize=(10, 10))
gs = gridspec.GridSpec(1, 2, width_ratios=[1, 1])
ax1 = plt.subplot(gs[0])
ax2 = plt.subplot(gs[1])
ax1.pie(AttOvered[AttOvered["OverTime"]== "Yes"]["Count"], labels=AttOvered[AttOvered["OverTime"]== "Yes"]["Attrition"], autopct='%1.1f%%',explode=[0, 0.1])
ax2.pie(AttOvered[AttOvered["OverTime"]== "No"]["Count"], labels=AttOvered[AttOvered["OverTime"]== "No"]["Attrition"], autopct='%1.1f%%',explode=[0, 0.1])
# Set titles for the subplots
ax1.set_title("Attrition Percentage with Over Time")
ax2.set_title("Attrition Percentage with no Over Time")
ax2.legend()
# Display the pie charts
plt.show()
ChiSquared("OverTime")
(87.56429365828768, 3.841458820694124, 8.15842372153832e-21)
$X^2 = 87.5643$
$X^2_c = 3.8414$
$Pvalue = 8.1584*{10}^{-21}$
Since $X^2 >> X^2_c$ or $P << 0.05$ We Will reject our Null Hypothesis and accept the alternative that says " Overtime has a significant effect on the decision of quit ".
ChiSquared("Department","OverTime")
(0.09360659979986957, 5.991464547107979, 0.9542750851354225)
$X^2 = 0.0936$
$X^2_c = 5.9914$
$Pvalue = 0.9543$
Since $X^2 < X^2_c$ or $P > 0.05$ we can conclude that Overtime is independent of departments.
px.box(df, x="Department", y="MonthlyIncome", color="OverTime")
CalcMEE("OverTime","MonthlyIncome")
| OverTime | Total number | Standard Deviation | Mean | Max Error | |
|---|---|---|---|---|---|
| 0 | Yes | 388 | 3502.004143 | 5685.396907 | +/- 349.5499 |
| 1 | No | 969 | 3261.547377 | 5441.221878 | +/- 205.6143 |
TTest("MonthlyIncome","OverTime" )
(0.2333121788024749, 1.9615812836543436, 0.8155515298402164)
$1- \alpha = 95\%$
T Value= $0.2333$
Critical value= $1.96$
P_value= $0.8155$
Since $T < |T_C|$ or $P > 0.05$ We can say that " There is no significant difference in monthly income between employees wo work over time and others who don't".
px.box(df, x="OverTime", y="MonthlyIncome", points="all")
Recap:¶
- Over 28% of employees in orgnization work overtime.
- Over 30% of employees who left the orgnization worked overtime.
- Overtime has a significant effect on the decision of quit.
- Overtime option does not depend on department.
- There is no significant difference in monthly income between employees wo work over time and others who don't.
def Countdf(st=""):
dfToPlot=df.groupby([st,'Attrition'])["Age"].count().reset_index(name='Counts')
dfToPlot["Percentage"]=round((dfToPlot["Counts"]/ sum(dfToPlot["Counts"])) * 100, 2)
dfToPlot["Percentage"]=dfToPlot["Percentage"].astype(str).apply(lambda x: x +"%")
return dfToPlot
def plotIt(dfToPlot,st=""):
ax= sns.barplot(
x = st,
y = 'Counts',
hue = 'Attrition',
data = dfToPlot)
for i in ax.containers:
ax.bar_label(i,)
bin_edges = [0, 1, 3, 5, 10]
bin_labels = ['0-1 Companies', '2-3 companies', '4-5 companies', "5+ companies"]
df["NumCompaniesWorkedGroup"] = pd.cut(df['NumCompaniesWorked'], bins=bin_edges, labels=bin_labels)
byNumCompany= Countdf("NumCompaniesWorkedGroup")
plotIt(byNumCompany,"NumCompaniesWorkedGroup")
fig=go.Figure()
byCompany=df.groupby(["NumCompaniesWorkedGroup",'Attrition'],as_index=False)['Age'].count()
byCompany.rename(columns={'Age':'Count'},inplace=True)
fig = make_subplots(rows=1, cols=4,
specs=[[{"type": "pie"}, {"type": "pie"},{"type": "pie"}, {"type": "pie"}]],
subplot_titles=('0-1 Companies', '2-3 companies','4-5 companies','5+ companies'))
fig.add_trace(go.Pie(values=byCompany[byCompany["NumCompaniesWorkedGroup"]=='0-1 Companies']['Count'],labels=byCompany[byCompany["NumCompaniesWorkedGroup"]=='0-1 Companies']['Attrition'],pull=[0,0.1],showlegend=False)
,row=1,col=1)
fig.add_trace(go.Pie(values=byCompany[byCompany["NumCompaniesWorkedGroup"]=='2-3 companies']['Count'],labels=byCompany[byCompany["NumCompaniesWorkedGroup"]=='2-3 companies']['Attrition'],pull=[0,0.1],showlegend=False)
,row=1,col=2)
fig.add_trace(go.Pie(values=byCompany[byCompany["NumCompaniesWorkedGroup"]=='4-5 companies']['Count'],labels=byCompany[byCompany["NumCompaniesWorkedGroup"]=='4-5 companies']['Attrition'],pull=[0,0.1],showlegend=False)
,row=1,col=3)
fig.add_trace(go.Pie(values=byCompany[byCompany["NumCompaniesWorkedGroup"]=='5+ companies']['Count'],labels=byCompany[byCompany["NumCompaniesWorkedGroup"]=='5+ companies']['Attrition'],pull=[0,0.1],showlegend=True)
,row=1,col=4)
fig.update_layout(title_x=0.5,template='simple_white',showlegend=True,
legend_title_text="Attrition",title_text=f"<b style='color:black; font-size:100%;'>Employee Attrition based on number of previos companies",
font_family="Times New Roman",title_font_family="Times New Roman")
T Test between employees who quit and the others who stayed.
Null hypothesis($H_0$) : Number of previous companies worked for does not affect the attrition.¶
alternative hypothesis($H_a$) : Number of previous companies worked for affect the attrition.¶
TTest("NumCompaniesWorked","Attrition" )
(1.6680187953544354, 1.9615812836543436, 0.0955252620565195)
T Value= $1.668$
Critical value= $1.96$
P_value= $0.09552$
Since $T < |T_C|$ or $P > 0.05$ We Will Accept our Null Hypothesis that says " Number of previous companies worked for does not affect the attrition ".
def PlotWithKde(st=""):
sns.histplot(x=st,hue="Attrition",data=df,kde=True)
plt.title(f"Employee Distribution by {st}")
plt.show()
Times =["TotalWorkingYears","YearsAtCompany","YearsSinceLastPromotion","YearsWithCurrManager","YearsInCurrentRole"]
for st in Times:
PlotWithKde(st)
T_Score = {}
CriticalT = {}
p_values = {}
for col in Times:
T_Score[col],CriticalT[col],p_values[col] = TTest(col,"Attrition")
columns = list(T_Score.keys())
values = list(T_Score.values())
critical = list(CriticalT.values())
test_df = pd.DataFrame(
{"Features":columns,
"T Value":values,
"Critical Value":critical}
)
test_df["P_value"] = [format(p, '.20f') for p in list(p_values.values())]
test_df["P_value"] = test_df["P_value"].astype(float)
test_df["Result"] = test_df["P_value"].map(lambda x:"Accept" if x > 0.05 else "Reject")
test_df
| Features | T Value | Critical Value | P_value | Result | |
|---|---|---|---|---|---|
| 0 | TotalWorkingYears | -6.652255 | 1.961581 | 4.061878e-11 | Reject |
| 1 | YearsAtCompany | -5.196309 | 1.961581 | 2.318872e-07 | Reject |
| 2 | YearsSinceLastPromotion | -1.265788 | 1.961581 | 2.057900e-01 | Accept |
| 3 | YearsWithCurrManager | -6.059069 | 1.961581 | 1.736987e-09 | Reject |
| 4 | YearsInCurrentRole | -6.232038 | 1.961581 | 6.003186e-10 | Reject |
Recap:¶
- Most of the employees have a total of 5 to 10 years of working experience.
- High Attrition Rate between employee with less working experience.
- Most of the employees have been promoted recently.
- Most employees has worked for 2 : 5 or 7 : 10 years for the same role in the organization.
- Very few employees has worked for less than 1 year or more than 10 years in the same role.
- Number of previous companies worked for and years of service under same manager does not affect the attrition.
- Number of previous companies worked for and years since last promotion does not affect the attrition.
- Quit decision affects by working years.
SatisfactionList = ['EnvironmentSatisfaction', 'JobSatisfaction','RelationshipSatisfaction']
fig=go.Figure()
fig = make_subplots(rows=1, cols=3,
specs=[[{"type": "pie"}, {"type": "pie"}, {"type": "pie"}]],
subplot_titles=('Environment Satisfaction', 'Job Satisfaction','Relationship Satisfaction'))
EnvironmenSat=df.groupby(["EnvironmentSatisfaction"],as_index=False)["Age"].count()
jobSat=df.groupby(["JobSatisfaction"],as_index=False)["Age"].count()
RelationshipSat=df.groupby(["RelationshipSatisfaction"],as_index=False)["Age"].count()
fig.add_trace(go.Pie(
values=EnvironmenSat["Age"],
labels=EnvironmenSat["EnvironmentSatisfaction"],
hole=0.5,
name='Environment Satisfaction',
showlegend=False)
,row=1,col=1)
fig.add_trace(go.Pie(
values=jobSat["Age"],
labels=jobSat["JobSatisfaction"],
hole=0.5,
name='Job Satisfaction',
showlegend=False)
,row=1,col=2)
fig.add_trace(go.Pie(
values=RelationshipSat["Age"],
labels=RelationshipSat["RelationshipSatisfaction"],
hole=0.5,
name='Relationship Satisfaction',
showlegend=True)
,row=1,col=3)
fig.update_layout(title_x=0.5,template='simple_white',showlegend=True,
legend_title_text="Satisfaction",
title_text='<b style="color:black; font-size:100%;">Employee Satisfaction Analysis',
font_family="Times New Roman",title_font_family="Times New Roman")
for st in SatisfactionList:
#print(st)
#print(ChiSquared(st))
ChiSquared(st)
| Satisfaction | ${X^2}$ | ${X^2}_c$ | $Pvalue$ | ||
|---|---|---|---|---|---|
| EnvironmentSatisfaction | 22.5038 | 7.8147 | 5.1234e-05 | ||
| JobSatisfaction | 17.5051 | 7.8147 | 0.0005 | ||
| RelationshipSatisfaction | 5.2411 | 7.8147 | 0.1549 |
for st in CatList[:3]:
PlotPies(st)
for st in SatisfactionList:
#print(st)
#print(SingleANOVA(st,"MonthlyIncome"))
SingleANOVA(st,"MonthlyIncome")
| Satisfaction | $F$ | $F_c$ | $Pvalue$ | ||
|---|---|---|---|---|---|
| EnvironmentSatisfaction | 0.4100 | 2.6109 | 0.7458 | ||
| JobSatisfaction | 0.0270 | 2.6109 | 0.9940 | ||
| RelationshipSatisfaction | 0.5529 | 2.6109 | 0.6462 |
for st in SatisfactionList:
#print(st)
#print(SpecialSingleANOVA(st,"MonthlyIncome"))
SpecialSingleANOVA(st,"MonthlyIncome")
| Satisfaction | $F$ | $F_c$ | $Pvalue$ | ||
|---|---|---|---|---|---|
| EnvironmentSatisfaction | 6.2719 | 2.0158 | 2.8786 $*{10}^{-7}$ | ||
| JobSatisfaction | 5.6935 | 2.0158 | 1.6477 $*{10}^{-6}$ | ||
| RelationshipSatisfaction | 6.5153 | 2.0158 | 1.3754 $*{10}^{-7}$ |
Level of satisfaction does not depend on monthly income but ANOVA test between groups of employees who quit or stay having defferent levels of satisfaction are differ alot in their monthly income which emphasize again that monthly income has a huge effect on decision of quit.
for st in SatisfactionList:
fig= px.box(df, x=st, y="MonthlyIncome", color="Attrition")
fig.show()
Recap:¶
- Most employees have high or viry high level of satisfaction.
- Environment and job satisfaction have an effect on attrition but relationship satisfaction don't have.
- The lower the level of satisfaction the more employees quit.
- Level of satisfaction does not depend on monthly income.
- Monthly income is significantly different between groups who left and others who stayed at each level of satisfaction.
sns.kdeplot(x=df['Age'],color='blue',fill=True,label='Age')
plt.axvline(x=df['Age'].mean(),color='red',linestyle ="--",label='Mean Age: 36.923')
plt.legend()
plt.title('Distribution of Age')
plt.show()
def ShowHist():
fig, axes = plt.subplots(1, 3, sharex=True, figsize=(15,5))
fig.suptitle('Attrition Age Distribution by Gender')
sns.histplot(ax=axes[0],x="Age",hue="Attrition",data=df,kde=True,palette=["r","b"])
axes[0].set_title('Overall')
sns.histplot(ax=axes[1],x="Age",hue="Attrition",data=df[df["Gender"]=="Male"].drop([1]),kde=True,palette=["r","b"])
axes[1].set_title('Male')
sns.histplot(ax=axes[2],x="Age",hue="Attrition",data=df[df["Gender"]=="Female"],kde=True,palette=["r","b"])
axes[2].set_title('Female')
plt.show()
def ShowBox():
fig, axes = plt.subplots(1, 3, sharex=True, figsize=(15,5))
fig.suptitle('Attrition Age Distribution by Gender')
sns.boxplot(ax=axes[0],y="Age",x="Attrition",data=df)
sns.boxplot(ax=axes[1],y="Age",x="Attrition",data=df[df["Gender"]=="Male"].drop([1]))
sns.boxplot(ax=axes[2],y="Age",x="Attrition",data=df[df["Gender"]=="Female"])
plt.show()
TTest("Age", "Attrition")
(-6.178663835307217, 1.9615812836543436, 8.356308021103587e-10)
T Value= $-6.1786$
Critical value= $1.961$
P_value= $8.3563 * {10}^{-10}$
Since $T > |T_C|$ or $P << 0.05$ We can say that " There is a significant difference in Ages between employees wo quit and others who stayed which indicates that age has an effect on attrition".
TTest("Age", "Gender")
(1.3921381802920636, 1.9615812836543436, 0.16409141231818586)
SpecialSingleANOVA("Gender","Age")
(14.58158957406234, 2.3213794446188195e-09, 2.6109723453486713)
Overall Test ($T$ test):
$T = 1.3921$
$T_c = 1.96$
$Pvalue_{T} = 0.1641$When taking attrition into consideration (ANOVA):
$F = 14.5816$
$F_c = 2.6109$
$Pvalue_F = 2.3214*{10}^{-9}$
Since $T < |T_c| $ or $P_{T} > 0.05$ we can conclude that no difference in ages of males or females in general but when divide groups based on gender and attrition we can see that there is a significant difference in ages between those groups since $F > F_C$ or $P_C << 0.5$.
px.box(df, x="Gender", y="Age")
ShowHist()
ShowBox()
c= stats.pearsonr(df["MonthlyIncome"], df["Age"])
c
PearsonRResult(statistic=0.49785456692658037, pvalue=6.669539203000345e-93)
slope, intercept, r_value, p_value, std_err = stats.linregress(df['Age'], df['MonthlyIncome'])
line = slope * df['Age'] + intercept
X =np.array(df["Age"])
y=df["MonthlyIncome"]
X_with_intercept = sm.add_constant(X)
SMmodel = sm.OLS(y, X_with_intercept).fit()
# Get predictions and prediction intervals
predictions = SMmodel.get_prediction(X_with_intercept)
pi = predictions.conf_int(obs=True)
max_error =(pi[:,1] - pi[:,0])/2 #(upper - lower)/2 0.95 CI
max_error_df = pd.DataFrame({
'Age': df['Age'],
'Upper_bound': line + max_error,
'Lower_bound': line - max_error
})
fig = px.scatter(df, x='Age', y='MonthlyIncome',trendline='ols',title='Relationship between Age and Monthly Income with Prediction Interval CI = 95%')
fig.update_traces(line=dict(color="rgba(255, 0, 0,0.05)", width=180))
fig.add_trace(px.line(max_error_df, x='Age', y='Upper_bound',line_shape='linear').data[0])
fig.add_trace(px.line(max_error_df, x='Age', y='Lower_bound',line_shape='linear').data[0])
fig.add_trace(px.line( x=df['Age'], y=line,line_shape='linear').data[0])
fig.show()
PI.¶Corrlation statistis $P_r= 0.5$ which indicate that there is a moderate positive correlation between Monthly income and age.¶
fig = px.scatter(df, x='Age', y='MonthlyIncome',color="Attrition", trendline='ols'
, opacity=0.5,
title='Attrition between Age and Monthly Income')
# Show the plot
fig.show()
Recap:¶
- Age distribution is a slightly right-skewed normal distribution with the bulk of the staff between 25 and 45 years old.
- There is a significant difference in Ages between employees wo quit and others who stayed which indicates that age has an effect on attrition.
- No difference in ages of males or females in general but when divide groups based on gender and attrition we can see that there is a significant difference in ages between those groups.
- Employees with 25 to 35 years old are more likely to quit.
- There is a moderate positive correlation between Monthly income and age.
- Employees who quit after 25 years old was payed less than others in their same age.
ChiSquared("Gender", "Attrition")
(1.1169671241970975, 3.841458820694124, 0.29057244902890855)
$X^2 = 1.116$
$X^2_c = 3.841$
$Pvalue = 0.2905$
Since $X^2 < X^2_c$ or $P > 0.05$ We Will Accept Null Hypothesis that " Gender has no effect on employee attrition".
fig = plt.figure(figsize=(10, 5))
gs = gridspec.GridSpec(1, 2, width_ratios=[1, 1])
Gender =df.groupby(["Gender",'Attrition'],as_index=False)['Age'].count()
Gender.rename(columns={'Age':'Counts'},inplace=True)
ax0 = plt.subplot(gs[0])
ax1 = plt.subplot(gs[1])
ax0.pie(Gender[Gender["Gender"]== "Male"]["Counts"], labels=Gender[Gender["Gender"]== "Male"]["Attrition"], autopct='%1.1f%%',explode=[0, 0.1])
ax1.pie(Gender[Gender["Gender"]== "Female"]["Counts"], labels=Gender[Gender["Gender"]== "Female"]["Attrition"], autopct='%1.1f%%',explode=[0, 0.1])
ax0.set_title("Attrition Percentage for Males")
ax1.set_title("Attrition Percentage for Females")
plt.show()
byEduLevel = df.groupby(['Education', 'Attrition'])["Age"].count().reset_index(name='Counts')
byEduLevel = AddPercentage(byEduLevel,"Education")
EduOrder = ["Below College","College","Bachelor","Master","Doctor"]
px.bar(byEduLevel, y ="Counts", x ="Education",color="Attrition", text='percent',category_orders= {'Education':EduOrder})
All has almost the same probability but doctors has less probability to quit.
byEduField = df.groupby(['EducationField', 'Attrition'])["Age"].count().reset_index(name='Counts')
byEduField = AddPercentage(byEduField,"EducationField")
px.bar(byEduField, y ="Counts", x ="EducationField",color="Attrition", text='percent')
_,_ = ANOVA("Education")
ChiSquared("Education")
(3.0739613982367193, 9.487729036781154, 0.5455253376565949)
$X^2 = 3.0739$
$X^2_c = 9.4877$
$Pvalue = 0.5455$
Since $X^2 < X^2_c$ or $P > 0.05$ We Will Accept Null Hypothesis that " Education has no effect on employee attrition".
ChiSquared("EducationField")
(16.024674119585427, 11.070497693516351, 0.006773980139025212)
$X^2 = 16.0246$
$X^2_c = 11.0704$
$Pvalue = 0.0067$
Since $X^2 > X^2_c$ or $P < 0.05$ We Will Reject Null Hypothesis that " Education Field has a significant effect on employee attrition".
Recap:¶
- Most employees in the organization have completed Bachelors or Masters and few employees completed Doctorate degree.
- Doctors has less probability to quit.
- Mostemployees are either from Life Science or Medical Education Field and Very few employees are from Human Resources Education Field.
- Education Fields like Human Resources, Marketing, Technical is having very high attrition rate.
- Employees from different eduactional background vary in thier ages and monthly income
- Education Field has a significant effect on employee attrition but Education hasn't.
ChiSquared("MaritalStatus")
(46.163676540848705, 5.991464547107979, 9.45551106034083e-11)
$X^2 = 46.163$
$X^2_c = 5.991$
$Pvalue = 9.4555 *{10}^{-11}$
Since $X^2 >> X^2_c$ or $P << 0.05$ We Will Geject Null Hypothesis that " Marital Status has affect employee attrition".
bus=df.groupby(["MaritalStatus",'Attrition'],as_index=False)['Age'].count()
bus.rename(columns={'Age':'Count'},inplace=True)
fig=go.Figure()
fig = make_subplots(rows=1, cols=3,
specs=[[{"type": "pie"}, {"type": "pie"},{"type": "pie"}]],
subplot_titles=('Divorced', 'Married','Single'))
fig.add_trace(go.Pie(values=bus[bus["MaritalStatus"]=='Divorced']['Count'],labels=bus[bus["MaritalStatus"]=='Divorced']['Attrition'],pull=[0,0.1],showlegend=False)
,row=1,col=1)
fig.add_trace(go.Pie(values=bus[bus["MaritalStatus"]=='Married']['Count'],labels=bus[bus["MaritalStatus"]=='Married']['Attrition'],pull=[0,0.1],showlegend=False)
,row=1,col=2)
fig.add_trace(go.Pie(values=bus[bus["MaritalStatus"]=='Single']['Count'],labels=bus[bus["MaritalStatus"]=='Single']['Attrition'],pull=[0,0.1],showlegend=True)
,row=1,col=3)
fig.update_layout(title_x=0.5,template='simple_white',showlegend=True,
legend_title_text="Attrition",title_text=f"<b style='color:black; font-size:100%;'>Employee Attrition based on Marital Status",
font_family="Times New Roman",title_font_family="Times New Roman")
fig.show()
px.box(df, y="DistanceFromHome",x="Attrition", color="Attrition", points="all" )
plt.figure(figsize=(13.5,6))
plt.subplot(1,2,1)
sns.histplot(x="DistanceFromHome",hue="Attrition",data=df,kde=True)
plt.title("Employee Distribution by Distance From Home & Attrition")
plt.subplot(1,2,2)
sns.boxplot(x="Attrition",y="DistanceFromHome",data=df)
plt.title("Employee Distribution by Distance From Home & Attrition")
plt.show()
TTest("DistanceFromHome", "Attrition")
(2.994708098265125, 1.9615812836543436, 0.0027930600802134266)
$T = 2.9947$
$T_c = 1.96$
$Pvalue = 0.0028$
Since $T > |T_c| $ or $P < 0.05$ we can conclude that The distance from home play a role in turnover.
WL_Balance=df.groupby(["WorkLifeBalance"],as_index=False)["Age"].count()
fig =px.pie(
values=WL_Balance["Age"],
names=WL_Balance["WorkLifeBalance"],title="Work life balance rate in the orgnization",
hole=0.5)
fig.show()
PlotPies("WorkLifeBalance")
ChiSquared("WorkLifeBalance")
(16.3250970916474, 7.814727903251179, 0.0009725698845348824)
$X^2 = 16.3251$
$X^2_c = 7.8147$
$Pvalue = 0.0009$
Since $X^2 > X^2_c$ or $P < 0.05$ We Will Reject Null Hypothesis that " Work life Balance has an effect on employee attrition".
Recap:¶
- Sigle employees are more likely to quit.
- Marital Status has affect employee attrition.
- Most employees Located at distance from 0 to 10 to the company.
- The distance from home play a role in turnover.
- Most employees has a high or medium WorkLife Balance.
- Employees who have low WorkLife Balance are more likely to quit.
- Work life Balance has an effect on employee attrition.
corr = df.select_dtypes("number").corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(15, 10))
sns.heatmap(corr,
vmax=1,
mask=mask,
annot=True, fmt='.2f',
linewidths=.2, cmap=sns.color_palette("Reds"))
<Axes: >
Recap:¶
- Features related with years of service are high correlated with each other.
- Age is moderate correlated with monthly income but highly correlated with total working years.
- The higher the total working years higher the monthly income.
We will use this analysis when training a model to drop some features.¶
CorrScore = {}
p_values = {}
for col in df.select_dtypes("number").columns:
cc, p = stats.pointbiserialr(df["Attrition"].replace({"Yes":1,"No": 0}), df[col])
CorrScore[col] = cc
p_values[col] = p
columns = list(CorrScore.keys())
values = list(CorrScore.values())
test_df = pd.DataFrame(
{"Features":columns,
"Corr":values}
)
test_df["P_value"] = [format(p, '.20f') for p in list(p_values.values())]
test_df["P_value"] = test_df["P_value"].astype(float)
test_df["Result"] = test_df["P_value"].map(lambda x:"Accept" if x > 0.05 else "Reject")
test_df
| Features | Corr | P_value | Result | |
|---|---|---|---|---|
| 0 | Age | -0.159205 | 8.356308e-10 | Reject |
| 1 | DailyRate | -0.056652 | 2.985816e-02 | Reject |
| 2 | DistanceFromHome | 0.077924 | 2.793060e-03 | Reject |
| 3 | HourlyRate | -0.006846 | 7.931348e-01 | Accept |
| 4 | JobInvolvement | -0.130016 | 5.677065e-07 | Reject |
| 5 | MonthlyIncome | -0.159840 | 7.147364e-10 | Reject |
| 6 | MonthlyRate | 0.015170 | 5.611236e-01 | Accept |
| 7 | NumCompaniesWorked | 0.043494 | 9.552526e-02 | Accept |
| 8 | PercentSalaryHike | -0.013478 | 6.056128e-01 | Accept |
| 9 | TotalWorkingYears | -0.171063 | 4.061878e-11 | Reject |
| 10 | TrainingTimesLastYear | -0.059478 | 2.257850e-02 | Reject |
| 11 | YearsAtCompany | -0.134392 | 2.318872e-07 | Reject |
| 12 | YearsInCurrentRole | -0.160545 | 6.003186e-10 | Reject |
| 13 | YearsSinceLastPromotion | -0.033019 | 2.057900e-01 | Accept |
| 14 | YearsWithCurrManager | -0.156199 | 1.736987e-09 | Reject |
test_df[test_df["P_value"]<0.05]["Features"]
0 Age 1 DailyRate 2 DistanceFromHome 4 JobInvolvement 5 MonthlyIncome 9 TotalWorkingYears 10 TrainingTimesLastYear 11 YearsAtCompany 12 YearsInCurrentRole 14 YearsWithCurrManager Name: Features, dtype: object
Reject $H_0$ for those features as they has a significant effect on Attrition¶
[ "Age" , "DailyRate" "DistanceFromHome" , "JobInvolvement" "MonthlyIncome" , "TotalWorkingYears" "TrainingTimesLastYear" , "YearsAtCompany" "YearsInCurrentRole" , "YearsWithCurrManager" ]
Note : same results as t test
PlotErrors(p_values)
plt.title("P values results of correlation analysis between numerical features with Attrition")
Text(0.5, 1.0, 'P values results of correlation analysis between numerical features with Attrition')
df.select_dtypes("number").corrwith(df["Attrition"].replace({"Yes":1,"No": 0})).sort_values().plot(kind='barh')
<Axes: >
test_df, p= ApplyTTest("Attrition")
test_df
| Features | T Value | Critical Value | P_value | Result | |
|---|---|---|---|---|---|
| 0 | Age | -6.178664 | 1.961581 | 8.356308e-10 | Reject |
| 1 | DailyRate | -2.174084 | 1.961581 | 2.985816e-02 | Reject |
| 2 | DistanceFromHome | 2.994708 | 1.961581 | 2.793060e-03 | Reject |
| 3 | HourlyRate | -0.262290 | 1.961581 | 7.931348e-01 | Accept |
| 4 | JobInvolvement | -5.024140 | 1.961581 | 5.677065e-07 | Reject |
| 5 | MonthlyIncome | -6.203936 | 1.961581 | 7.147364e-10 | Reject |
| 6 | MonthlyRate | 0.581306 | 1.961581 | 5.611236e-01 | Accept |
| 7 | NumCompaniesWorked | 1.668019 | 1.961581 | 9.552526e-02 | Accept |
| 8 | PercentSalaryHike | -0.516457 | 1.961581 | 6.056128e-01 | Accept |
| 9 | TotalWorkingYears | -6.652255 | 1.961581 | 4.061878e-11 | Reject |
| 10 | TrainingTimesLastYear | -2.282903 | 1.961581 | 2.257850e-02 | Reject |
| 11 | YearsAtCompany | -5.196309 | 1.961581 | 2.318872e-07 | Reject |
| 12 | YearsInCurrentRole | -6.232038 | 1.961581 | 6.003186e-10 | Reject |
| 13 | YearsSinceLastPromotion | -1.265788 | 1.961581 | 2.057900e-01 | Accept |
| 14 | YearsWithCurrManager | -6.059069 | 1.961581 | 1.736987e-09 | Reject |
plotP(p,"")
test_df[test_df["P_value"] > 0.05]["Features"]
3 HourlyRate 6 MonthlyRate 7 NumCompaniesWorked 8 PercentSalaryHike 13 YearsSinceLastPromotion Name: Features, dtype: object
Reject $H_0$ for those features as there is significant difference in those features between employees who left and those who stayed¶
[ "Age" , "DailyRate" "DistanceFromHome" , "JobInvolvement" "MonthlyIncome" , "TotalWorkingYears" "TrainingTimesLastYear" , "YearsAtCompany" "YearsInCurrentRole" , "YearsWithCurrManager" "NumCompaniesWorked" ]
chi2_statistic = {}
p_values = {}
criticalChis = {}
# Perform chi-square test for each column
for col in df.select_dtypes("object").drop("Attrition", axis =1).columns:
chi2, crit,p_value = ChiSquared(col)
chi2_statistic[col] = chi2
p_values[col] = p_value
criticalChis[col] = crit
columns = list(chi2_statistic.keys())
values = list(chi2_statistic.values())
critical = list(criticalChis.values())
test_df = pd.DataFrame(
{"Features":columns,
"Chi_2 Statistic":values,
"Critical Value":critical}
)
test_df["P_value"] = [format(p, '.20f') for p in list(p_values.values())]
test_df["P_value"] = test_df["P_value"].astype(float)
test_df["Result"] = test_df["P_value"].map(lambda x:"Accept" if x > 0.05 else "Reject")
test_df
| Features | Chi_2 Statistic | Critical Value | P_value | Result | |
|---|---|---|---|---|---|
| 0 | BusinessTravel | 24.182414 | 5.991465 | 5.608614e-06 | Reject |
| 1 | Department | 10.796007 | 5.991465 | 4.525607e-03 | Reject |
| 2 | Education | 3.073961 | 9.487729 | 5.455253e-01 | Accept |
| 3 | EducationField | 16.024674 | 11.070498 | 6.773980e-03 | Reject |
| 4 | EnvironmentSatisfaction | 22.503881 | 7.814728 | 5.123469e-05 | Reject |
| 5 | Gender | 1.116967 | 3.841459 | 2.905724e-01 | Accept |
| 6 | JobLevel | 72.529013 | 9.487729 | 6.634680e-15 | Reject |
| 7 | JobRole | 86.190254 | 15.507313 | 2.752480e-15 | Reject |
| 8 | JobSatisfaction | 17.505077 | 7.814728 | 5.563005e-04 | Reject |
| 9 | MaritalStatus | 46.163677 | 5.991465 | 9.455511e-11 | Reject |
| 10 | OverTime | 87.564294 | 3.841459 | 1.000000e-20 | Reject |
| 11 | PerformanceRating | 0.000155 | 3.841459 | 9.900745e-01 | Accept |
| 12 | RelationshipSatisfaction | 5.241068 | 7.814728 | 1.549724e-01 | Accept |
| 13 | StockOptionLevel | 60.598301 | 7.814728 | 4.379390e-13 | Reject |
| 14 | WorkLifeBalance | 16.325097 | 7.814728 | 9.725699e-04 | Reject |
set(test_df[test_df["P_value"]<0.05]["Features"])
{'BusinessTravel',
'Department',
'EducationField',
'EnvironmentSatisfaction',
'JobLevel',
'JobRole',
'JobSatisfaction',
'MaritalStatus',
'OverTime',
'StockOptionLevel',
'WorkLifeBalance'}
Reject $H_0$ for those features as they has a significant effect on Attrition¶
[ "BusinessTravel" , "Department" "EducationField" , "EnvironmentSatisfaction" "JobLevel" , "JobRole" "JobSatisfaction" , "MaritalStatus" "OverTime" , "StockOptionLevel" "WorkLifeBalance" ]
PlotChi(columns,values)
lower $\chi^2$ indicates no relation -> (Accept region) but we can not certain of that untill we compare with critical value $\chi^2_c$
plotP(p_values,"",True)
PlotErrors(p_values,ISchi = True)
def evaluate(model, X_train, X_test, y_train, y_test):
y_test_pred = model.predict(X_test)
y_train_pred = model.predict(X_train)
acc_train = accuracy_score(y_train, y_train_pred)
acc_test = accuracy_score(y_test,y_test_pred)
testconfusion = confusion_matrix(y_test, y_test_pred)
trainconfusion = confusion_matrix(y_train, y_train_pred)
print("TRAINIG RESULTS: \n===============================")
print(f"CONFUSION MATRIX:\n{trainconfusion}")
print(f"ACCURACY SCORE:\n{acc_train:.4f}")
print("precision score:", round(precision_score(y_train,y_train_pred),2))
print("Recall Accuracy:", round(recall_score(y_train,y_train_pred),2))
print("Area Under Curve AUC:", round(roc_auc_score(y_train,y_train_pred),2))
print("\n\nTRAINIG RESULTS: \n===============================")
print(f"CONFUSION MATRIX:\n{testconfusion}")
print(f"ACCURACY SCORE:\n{acc_test:.4f}")
print("precision score:", round(precision_score(y_test,y_test_pred),2))
print("Recall Accuracy:", round(recall_score(y_test,y_test_pred),2))
print("Area Under Curve AUC:", round(roc_auc_score(y_test,y_test_pred),2))
data =df.copy()
data.drop(columns="NumCompaniesWorkedGroup", inplace =True)
data["Attrition"]= data["Attrition"].replace({"Yes":1,
"No": 0})
from category_encoders import OneHotEncoder
X = data.drop(columns=['Attrition','HourlyRate','MonthlyRate',
'NumCompaniesWorked','PercentSalaryHike','YearsSinceLastPromotion',
'JobInvolvement','Education','Gender','YearsAtCompany','PerformanceRating','YearsWithCurrManager'], axis=1)
y = data.Attrition
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,
stratify=y)# because the data is unbalanced
model = make_pipeline(
OneHotEncoder(use_cat_names=True),
StandardScaler(),
LogisticRegression(max_iter=1000)
)
model.fit(X_train,y_train)
Pipeline(steps=[('onehotencoder',
OneHotEncoder(cols=['BusinessTravel', 'Department',
'EducationField',
'EnvironmentSatisfaction', 'JobLevel',
'JobRole', 'JobSatisfaction',
'MaritalStatus', 'OverTime',
'RelationshipSatisfaction',
'StockOptionLevel', 'WorkLifeBalance'],
use_cat_names=True)),
('standardscaler', StandardScaler()),
('logisticregression', LogisticRegression(max_iter=1000))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('onehotencoder',
OneHotEncoder(cols=['BusinessTravel', 'Department',
'EducationField',
'EnvironmentSatisfaction', 'JobLevel',
'JobRole', 'JobSatisfaction',
'MaritalStatus', 'OverTime',
'RelationshipSatisfaction',
'StockOptionLevel', 'WorkLifeBalance'],
use_cat_names=True)),
('standardscaler', StandardScaler()),
('logisticregression', LogisticRegression(max_iter=1000))])OneHotEncoder(cols=['BusinessTravel', 'Department', 'EducationField',
'EnvironmentSatisfaction', 'JobLevel', 'JobRole',
'JobSatisfaction', 'MaritalStatus', 'OverTime',
'RelationshipSatisfaction', 'StockOptionLevel',
'WorkLifeBalance'],
use_cat_names=True)StandardScaler()
LogisticRegression(max_iter=1000)
features = model.named_steps["onehotencoder"].get_feature_names_out()
importances = model.named_steps["logisticregression"].coef_[0]
evaluate(model, X_train, X_test, y_train, y_test)
TRAINIG RESULTS: =============================== CONFUSION MATRIX: [[843 20] [ 83 83]] ACCURACY SCORE: 0.8999 precision score: 0.81 Recall Accuracy: 0.5 Area Under Curve AUC: 0.74 TRAINIG RESULTS: =============================== CONFUSION MATRIX: [[357 13] [ 42 29]] ACCURACY SCORE: 0.8753 precision score: 0.69 Recall Accuracy: 0.41 Area Under Curve AUC: 0.69
ConfusionMatrixDisplay.from_estimator(model, X_train, y_train)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x181699ee250>
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x18169c88490>
numOfSumples = 50
Accuracies = list()
for i in range(numOfSumples):
BootX_train , Booty_train = resample(X_train, y_train)
model.fit(BootX_train, Booty_train)
y_test_pred = model.predict(X_test)
score =accuracy_score(y_test, y_test_pred)
#print("TESTING RESULTS: \n===============================")
#print(f"CONFUSION MATRIX:\n{confusion_matrix(y_test, y_test_pred)}")
#print(f"ACCURACY SCORE:\n{score:.4f}")
Accuracies.append(score)
plt.errorbar(x = np.arange(1, len(Accuracies) +1 ,1), y= Accuracies, fmt='o')
plt.axhline(y=np.mean(Accuracies),color='red',linestyle ="--", label=f"Mean Accuracy = {round(np.mean(Accuracies),4)}")
plt.legend()
<matplotlib.legend.Legend at 0x1816a6eb810>
sns.histplot(Accuracies, kde= True)
<Axes: ylabel='Count'>
# confidence intervals
alpha = 0.95
p = ((1.0-alpha)/2.0) * 100
lower = max(0.0, np.percentile(Accuracies, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(Accuracies, p))
plt.figure(figsize=(12,6))
x= np.arange(1, len(Accuracies) +1 ,1)
y= Accuracies
y_error =(upper - lower) / 2.0
plt.errorbar(x, y,
yerr = y_error,
fmt ='o',capsize=3)
plt.xticks(rotation=90)
plt.axhline(y=np.mean(Accuracies),color='red')
<matplotlib.lines.Line2D at 0x18169ddc510>
print('At %.1f confidence interval Accuracy = %.1f%% +/- %.1f%%' % (alpha*100,np.mean(Accuracies)*100, (upper -lower)/2.0*100))
At 95.0 confidence interval Accuracy = 86.0% +/- 1.7%
print('At %.1f confidence interval Accuracy lies between %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))
At 95.0 confidence interval Accuracy lies between 84.1% and 87.5%
Resuls = df.copy()
Resuls["AttrionLikelihood"] = model.predict_proba(X)[:, 1]
sns.histplot(Resuls["AttrionLikelihood"], kde=True)
<Axes: xlabel='AttrionLikelihood', ylabel='Count'>
Resuls["RiskLevel"] = Resuls["AttrionLikelihood"].astype("object").map(lambda x:"Strong" if x > 0.4 else "Medium" if x > 0.3 else "Weak" )
Resuls[['Attrition', 'AttrionLikelihood', 'RiskLevel' ]][42:53]
| Attrition | AttrionLikelihood | RiskLevel | |
|---|---|---|---|
| 42 | Yes | 0.811160 | Strong |
| 43 | No | 0.288771 | Weak |
| 44 | No | 0.024744 | Weak |
| 45 | Yes | 0.000037 | Weak |
| 46 | No | 0.012220 | Weak |
| 47 | No | 0.208084 | Weak |
| 48 | No | 0.213677 | Weak |
| 49 | No | 0.069798 | Weak |
| 50 | Yes | 0.784926 | Strong |
| 51 | Yes | 0.762197 | Strong |
| 52 | No | 0.095868 | Weak |
pd.crosstab(Resuls["RiskLevel"] ,Resuls["Attrition"])
| Attrition | No | Yes |
|---|---|---|
| RiskLevel | ||
| Medium | 39 | 21 |
| Strong | 71 | 132 |
| Weak | 1123 | 84 |
feat_imp = pd.Series(importances, index=features).sort_values()
SortedDF= pd.DataFrame(feat_imp)
MaxMin =pd.concat([SortedDF.head(7), SortedDF.tail(7)], axis=0)
MaxMin.plot(kind="barh")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.title("Feature Importance")
Text(0.5, 1.0, 'Feature Importance')
def make_prediction(age, businesstravel,dailyrate,department,distanceFromHome,
educationfield,environmentsatisfaction,
joblevel,jobrole,jobsatisfaction,maritalstatus,
monthlyincome,overtime,relationshipsatisfaction,
stockoptionlevel,totalworkingyears,trainingtimeslastyear,
worklifebalance,yearsincurrentrole
):
data={
'Age':age,
'BusinessTravel':businesstravel,
'DailyRate':dailyrate,
'Department':department,
'DistanceFromHome' :distanceFromHome,
'EducationField':educationfield,
'EnvironmentSatisfaction':environmentsatisfaction,
'JobLevel':joblevel,
'JobRole':jobrole,
'JobSatisfaction':jobsatisfaction,
'MaritalStatus':maritalstatus,
'MonthlyIncome':monthlyincome,
'OverTime':overtime,
'RelationshipSatisfaction':relationshipsatisfaction,
'StockOptionLevel':stockoptionlevel,
'TotalWorkingYears':totalworkingyears,
'TrainingTimesLastYear':trainingtimeslastyear,
'WorkLifeBalance':worklifebalance,
'YearsInCurrentRole':yearsincurrentrole
}
df=pd.DataFrame(data,index=[0])
prediction = model.predict_proba(df)[:, 1][0]
if prediction > 0.4:
Risk ="Strong"
elif prediction > 0.3:
Risk = "Medium"
else:
Risk ="Weak"
return f"This employee has a {Risk} Risk to quit with Probability = {round(prediction,5)}."
interact(
make_prediction,
age=IntSlider(min=X_train["Age"].min(), max=X_train["Age"].max(),value=X_train["Age"].mean()),
businesstravel=Dropdown(options=sorted(X_train["BusinessTravel"].unique())),
dailyrate=IntSlider(min=X_train["DailyRate"].min(), max=X_train["DailyRate"].max(),value=X_train["DailyRate"].mean()),
department=Dropdown(options=sorted(X_train["Department"].unique())),
distanceFromHome=IntSlider(min=X_train["DistanceFromHome"].min(), max=X_train["DistanceFromHome"].max(),value=X_train["DistanceFromHome"].mean()),
educationfield=Dropdown(options=sorted(X_train["EducationField"].unique())),
environmentsatisfaction=Dropdown(options=sorted(X_train["EnvironmentSatisfaction"].unique())),
joblevel=Dropdown(options=sorted(X_train["JobLevel"].unique())),
jobrole=Dropdown(options=sorted(X_train["JobRole"].unique())),
jobsatisfaction=Dropdown(options=sorted(X_train["JobSatisfaction"].unique())),
maritalstatus=Dropdown(options=sorted(X_train["MaritalStatus"].unique())),
monthlyincome=IntSlider(min=X_train["MonthlyIncome"].min(), max=X_train["MonthlyIncome"].max(),value=X_train["MonthlyIncome"].mean()),
overtime=Dropdown(options=sorted(X_train["OverTime"].unique())),
relationshipsatisfaction=Dropdown(options=sorted(X_train["RelationshipSatisfaction"].unique())),
stockoptionlevel=Dropdown(options=sorted(X_train["StockOptionLevel"].unique())),
totalworkingyears=IntSlider(min=X_train["TotalWorkingYears"].min(), max=X_train["TotalWorkingYears"].max(),value=X_train["TotalWorkingYears"].mean()),
trainingtimeslastyear=IntSlider(min=X_train["TrainingTimesLastYear"].min(), max=X_train["TrainingTimesLastYear"].max(),value=X_train["TrainingTimesLastYear"].mean()),
worklifebalance=Dropdown(options=sorted(X_train["WorkLifeBalance"].unique())),
yearsincurrentrole=IntSlider(min=X_train["YearsInCurrentRole"].min(), max=X_train["YearsInCurrentRole"].max(),value=X_train["YearsInCurrentRole"].mean()),
);
interactive(children=(IntSlider(value=36, description='age', max=60, min=18), Dropdown(description='businesstr…
